Table of Contents

Load data

Question 1: Download the data, load it into python. Report the number of rows and columns that you've loaded.
Answer: The data has more than 1.5 M records (1510722) and 21 columns.

Visualize distance and time

Question 2: Visualize trip distance by time of day in any way you see fit, any observations?

Answer:

  1. From 'Plot 2-1: box plot of trip distance per hour' we can get a quick rough idea about the pattern of the trips per hour.

I can guess why people get up early in the morning and take these trips. Maybe it's people who need to get to an airport or train station in the early mornings that made those trips. In that case we should be able to see main stations among the most popular drop-off locations of these trips which happened in the early mornings.

  1. From 'Plot 2-2A: sum and median distance by hour'
  1. From 'Plot 2-3A and Plot 2-3B', we can see the patterns even more clear with a polar projection. By connecting 0 o'clock and 24 o'clock, the trip distance by time of day is shown in a more intuitive way.

Because we have not separated weekends from workdays, workdays clearly dominate the pattern of the sum of distance by hour. I can imagine a typical urban lifestyle: in the morning around 8 am or 9 am people who are late may take a taxi to the office. After work, they may have many recreational activities like going to bars and restaurants by cabs. Then they start to go home, after midnight most people are not traveling even in New York.

Summary of trip distance per hour

Sum, median and count per hour

From the plot below we can see that the highest peak of trip distance happens between 5 pm and 8 pm.
The second highest peak is between 8 am and 9 am.
From midnight to 7 am the sum of trip distance is obviously lower than other time of the day.
Because we have not separate weekend from workdays, workdays clearly dominate the pattern here.

Better visualization to show a whole day

From the plot 2-3A below we can see a clear and simple pattern of the median distance per trip per hour:
between 5 am and 6 am the median distance of the trips are significantly longer than the rest of the day.

Analyze the location

Question 3: What are the most popular pickup locations on weekends vs weekdays?

Answer:

  1. Firstly the pick-up locations in the dataset are given as longitude and latitude, if we simply group the data by the exact coordinates (see Table 3-1: A weekdays and Table 3-1 B weekend), the most popular pick-up location are the same: Forest Hills-71 Av subway station located at (40.7214, -73.8443) between weekdays and weekends. (The locations with 0 in coordinates are ignored as outliers)
  1. It may not be be the best idea to group by coordinates, since different exits of one large building/station can be hundreds meters away from each other. But they should be seen as one location if we want to make the analysis more meaningful. After filtering out the locations that are too far away from the rest, the rest locations are all within an area which is approximate a 16 km (W-E) by 30 km (S-N) rectangle on the map. My goal is to group the locations into 250 m * 250 m blocks. Then, we need roughly 100 bins for both longitude and latitude, which gives us 160 m by 300 m blocks. After binning the area into blocks, I made these two heatmaps to high light the blocks with more trips started there (See Figure 3-2A: Weekday heatmap Figure 3-2B Weekend heatmap), got the high-lighted blocks' coordinates and then marked them on a base map in Figure 3-2C Most popular pickup locations on weekdays vs weekends.

The top 1 popular pickup location is the same for both weekends and weekdays, which is still at the Forest Hills Station.

The top 3 locations for weekdays are:

The top 3 locations for weekends are:

This result shows that the popular pick-up locations are not very different between weekdays and weekends. Train stations in Queens are always the busiest among the three center boroughs (Manhattan, Queens and Brooklyn). The only difference might be: on weekdays Marcus Garvey Park in Manhattan is more active, while on weekends it's beaten by McCarren Park in Brooklyn.

  1. Depending on the goal of the questions, it may be more interesting to see which boroughs are more active pick up zones.

Group by exact coordinates. Table 3-1A weekdays and 3-1B weekend

Search by coarse resolution

Answer: The top 1 popular pickup location is the same for both weekends and weekdays, which is at the Forest Hills Station.

The top 3 locations for weekdays are:

The top 3 locations for weekends are:

Heatmaps

location coordinates

mark on real map

Predict number of trips by hour

Question 4: Build a model to forecast the number of trips by hour for the next 12 hours after Feb 12th 10:00 am. How well did you do?

Answer: The best model was trained on a combination of data from 2015-02 and data from 2016-02-01 to 2016-02-12 10 am, with mean absolute error 204.16 on test data (Feb 12 11 am - Feb 12 10 pm)

  1. Problem description
    It is a typical univariate time series forecasting problem and there are many existing solutions for it. The first one I can think of is called Seasonal AutoRegressive Integrated Moving Average (SARIMA) model. It has been used widely for time series forecasting for decades. Machine learning models like LSTM and causal Convnet are also very welcome for complex or high-dimensional time series analysis. However, 12 days of hourly single variable is too small to train a neural network model. Besides, SARIMA models' output are way easier to understand and evaluate.

  2. Differencing and stationarity Section 4.1
    Before I started building any models, I checked the stationary and seasonal trend of the number of trips by hour in the whole month of Feb 2016. From the Plot 4-1: Num of trips by hour we can get a good understanding about what the time series looks like. We can see that the original time series was not stable at all. The Augmented Dickey Fuller Test (ADF) is a unit root test for stationarity. The test results show that after making the order 1 differencing on the original time series, the new time series becomes stationary and the null hypothesis (there is a unit root in a univariate process) is rejected with p value < 0.000001.

  3. ACF and PACF Section 4.2.1
    The Autocorrelation function (ACF) and partial autocorrelation function (PACF) for a single time series variable can help us understand the temporal dynamics of an individual time series. The ACF correlogram is often used for moving average (MA) models hyper-parameter estimation and PACF correlogram is used for autoregressive (AR) models. The combination of an MA and AR model is called the AutoRegressive Integrated Moving Average (ARIMA) model. If we add seasonality into account we will get a Seasonal-ARIMA model (SARIMA). From the correlograms below I roughly selected a range for each of the parameters in a SARIMA model (p, d, q, P, D, Q, S), then in the next step I used grid search to find the best params to optimize the prediction.

  4. SARIMA model selection Section 4.2.2

  1. First model result evaluation and visualization Section 4.2.3 and Section 4.2.4
    Not surprisingly, the first model's prediction is not very good but it shows some potential. It has got an MAE of 450.75 when the average value of test data is 3435.5. This gives us a 13% weighted MAPE (MAE/MEAN), which is too large for us.
  1. Second model result evaluation and visualization Section 4.3
    After adding the Feb 2015 data into the training set. The MAE was decrease to 204.16, and the weighted MAPE lowered to 5.9%, which is much more promising given the small training set.
  1. Future steps:
    It's very possible that with more historical training data we can keep improving the SARIMA model's prediction accuracy. The other option is to try a different model. SARIMA models don't really support complex seasonality analysis but TBATS models do (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components).

Overview: seasonality, trend, stationary and difference

Seasonal ARIMA analysis

Though SARIMA doesn't support multiple seasonality modeling, it doesn't hurt to git it a try.

ACF and PACF

p d q P D Q selection

S-ARIMA model interpretation

SARIMA model result visualization and evaluation

How to improve

Add more training data

12 days of training is not enough to take more complex seasonality (i.e. weekly trend) into consideration. Adding whole data from January may not be a good idea, because the first few days of January is more like a holiday schedule compared to the rest of January. So, I added the 2015 Feb data into the training set (2016 Feb 01 12 am - 2016 Feb 12 10 am).

Next Step

SARIMA dose not support multiple seasonality modeling.

TBATS models (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is another option, which can handle complex seasonality with sufficient training data. However, it takes a much longer time to train one model.

Summary

If I am a taxi driver, in the evenings when most people are off from work, it is the busiest time of a day for me. The trips are most likely to be short distant, which means I will probably just need drive around in the downtown area. For the rest of the day, big stations are where I should go to pick up new passengers.
A sad fact is most passengers don't give me tips at all :( People who travel in the early mornings tip the most.